A diversity index is a method of measuring how many different types (e.g. species) there are in a dataset (e.g. a community). Diversity indices are statistical representations of different aspects of biodiversity (e.g. Species richness, Species evenness, and dominance), which are useful simplifications for comparing different communities or sites.
When diversity indices are used in ecology, the types of interest are usually species, but they can also be other categories, such as Genus, families, functional types, or . The entities of interest are usually individual organisms (e.g. plants or animals), and the measure of abundance can be, for example, number of individuals, biomass or coverage. In demography, the entities of interest can be people, and the types of interest various demographic groups. In information science, the entities can be characters and the types of the different letters of the alphabet. The most commonly used diversity indices are simple transformations of the effective number of types (also known as 'true diversity'), but each diversity index can also be interpreted in its own right as a measure corresponding to some real phenomenon (but a different one for each diversity index).
Many indices only account for categorical diversity between subjects or entities. Such indices, however do not account for the total variation (diversity) that can be held between subjects or entities which occurs only when both categorical and qualitative diversity are calculated.
Diversity indices described in this article include:
Some more sophisticated indices also account for the Phylogenetics relatedness among the types. These are called phylo-divergence indices, and are not yet described in this article.
The denominator equals the average proportional abundance of the types in the dataset as calculated with the weighted generalized mean with exponent . In the equation, is richness (the total number of types in the dataset), and the proportional abundance of the th type is . The proportional abundances themselves are used as the nominal weights. The numbers are called Hill numbers of order q or effective number of species.
When , the above equation is undefined. However, the mathematical limit as approaches 1 is well defined and the corresponding diversity is calculated with the following equation:
which is the exponential of the Shannon entropy calculated with natural logarithms (see above). In other domains, this statistic is also known as the perplexity.
The general equation of diversity is often written in the form
and the term inside the parentheses is called the basic sum. Some popular diversity indices correspond to the basic sum as calculated with different values of .
Generally, increasing the value of increases the effective weight given to the most abundant species. This leads to obtaining a larger value and a smaller true diversity () value with increasing .
When , the weighted geometric mean of the values is used, and each species is exactly weighted by its proportional abundance (in the weighted geometric mean, the weights are the exponents). When , the weight given to abundant species is exaggerated, and when , the weight given to rare species is. At , the species weights exactly cancel out the species proportional abundances, such that the weighted mean of the values equals even when all species are not equally abundant. At , the effective number of species, , hence equals the actual number of species . In the context of diversity, is generally limited to non-negative values. This is because negative values of would give rare species so much more weight than abundant ones that would exceed .
where is the proportion of characters belonging to the th type of letter in the string of interest. In ecology, is often the proportion of individuals belonging to the th species in the dataset of interest. Then the Shannon entropy quantifies the uncertainty in predicting the species identity of an individual that is taken at random from the dataset.
Although the equation is here written with natural logarithms, the base of the logarithm used when calculating the Shannon entropy can be chosen freely. Shannon himself discussed logarithm bases 2, 10 and , and these have since become the most popular bases in applications that use the Shannon entropy. Each log base corresponds to a different measurement unit, which has been called binary digits (bits), decimal digits (decits), and natural digits (nats) for the bases 2, 10 and , respectively. Comparing Shannon entropy values that were originally calculated with different log bases requires converting them to the same log base: change from the base to base is obtained with multiplication by .
The Shannon index () is related to the weighted geometric mean of the proportional abundances of the types. Specifically, it equals the logarithm of true diversity as calculated with :
This can also be written
which equals
Since the sum of the values equals 1 by definition, the denominator equals the weighted geometric mean of the values, with the values themselves being used as the weights (exponents in the equation). The term within the parentheses hence equals true diversity , and equals .
When all types in the dataset of interest are equally common, all values equal , and the Shannon index hence takes the value . The more unequal the abundances of the types, the larger the weighted geometric mean of the values, and the smaller the corresponding Shannon entropy. If practically all abundance is concentrated to one type, and the other types are very rare (even if there are many of them), Shannon entropy approaches zero. When there is only one type in the dataset, Shannon entropy exactly equals zero (there is no uncertainty in predicting the type of the next randomly chosen entity).
In machine learning the Shannon index is also called as Information gain.
which equals
This means that taking the logarithm of true diversity based on any value of gives the Rényi entropy corresponding to the same value of .
The measure equals the probability that two entities taken at random from the dataset of interest represent the same type. It equals:
where is richness (the total number of types in the dataset). This equation is also equal to the weighted arithmetic mean of the proportional abundances of the types of interest, with the proportional abundances themselves being used as the weights. Proportional abundances are by definition constrained to values between zero and one, but it is a weighted arithmetic mean, hence , which is reached when all types are equally abundant.
By comparing the equation used to calculate λ with the equations used to calculate true diversity, it can be seen that equals , i.e., true diversity as calculated with . The original Simpson's index hence equals the corresponding basic sum.
The interpretation of λ as the probability that two entities taken at random from the dataset of interest represent the same type assumes that the entities are sampled with replacement. If the dataset is very large, sampling without replacement gives approximately the same result, but in small datasets, the difference can be substantial. If the dataset is small, and sampling without replacement is assumed, the probability of obtaining the same type with both random draws is:
where is the number of entities belonging to the th type and is the total number of entities in the dataset. This form of the Simpson index is also known as the Hunter–Gaston index in microbiology.
Since the mean proportional abundance of the types increases with decreasing number of types and increasing abundance of the most abundant type, λ obtains small values in datasets of high diversity and large values in datasets of low diversity. This is counterintuitive behavior for a diversity index, so often, such transformations of λ that increase with increasing diversity have been used instead. The most popular of such indices have been the inverse Simpson index (1/λ) and the Corrado Gini–Simpson index (1 − λ). Both of these have also been called the Simpson index in the ecological literature, so care is needed to avoid accidentally comparing the different indices as if they were the same.
This simply equals true diversity of order 2, i.e. the effective number of types that is obtained when the weighted arithmetic mean is used to quantify average proportional abundance of types in the dataset of interest.
The index is also used as a measure of the effective number of parties.
|
|